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Abstract. The longest common subsequence problem is a long studied prototype of pattern matching 
problems. In spite of the effort dedicated to it, the numerical value of its central quantity, the Chvatal- 
Sankoff constant, is not yet known. Numerical estimations of this constant are very difficult due to finite 
size effects. We propose a numerical method to estimate the Chvatal-Sankoff constant which combines the 
advantages of an analytically known functional form of the finite size effects with an efficient multi-spin 
coding scheme. This method yields very high precision estimates of the Chvatal-Sankoff constant. Our 
results correct earlier estimates for small alphabet size while they are consistent with (albeit more precise 
than) earlier results for larger alphabet size. 
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1 Introduction 



Pattern matching problems have been of central interest in 
such different areas of science as image processing, speech 
, recognition, time series analysis, and biological sequence 
comparison for a large number of years. In many practical 
applications of pattern matching, it is important to be able 
to characterize the amount and/or strength of patterns 
found in random data in order to be able to distinguish 
, meaningful patterns from random ones . This requires a 
quantitative theory of pattern matching procedures which 
does not only characterize universal characteristics like 
scaling laws but which also provides specific numerical 
values for non- universal quantities. 

One of the simplest pattern matching problems is the 
longest common subsequence (LCS) problem. It is a spe- 
, cial case of the sequence comparison problem, which is a 
■ very important tool in modern molecular biology 0. As 
\ a prototype for pattern matching it has enjoyed the at- 
tention of mathematicians and computer scientists for a 
long time One of the central 

quantities in the LCS problem is the average fraction of 
matches in the longest common subsequence of two very 
long random strings also known as the Chvatal-Sankoff 
constant a c ||. However, in spite of the relative simplicity 
of the LCS problem and the long time it has been studied, 
the numerical value of the Chvatal-Sankoff constant is still 
unknown. There has been a long standing conjecture 

■> 2 



by Arratia |l4J for its value based on numerical simu- 
lations where c is the size of the alphabet from which 
the sequences are chosen at random. This conjecture has 
been recently shown to hold for a first-passage percola- 
tion version of the LCS problem using the analogy [ p"5| to 
surface growth as described by the Kardar-Parisi-Zhang 
(KPZ @) equation and a cavity method In 
the language of surface growth the Chvatal-Sankoff con- 
stant takes the interpretation of the average growth veloc- 
ity of the surface or of the ensemble averaged free energy 
per length in the directed polymer picture which is equiv- 
alent to the KPZ equation. 

For the original Chvatal-Sankoff constant (i.e., not the 
first-passage percolation version) a variety of upper and 
lower bounds have been rigorously proven in the mathe- 
matical literature J|,f|,[^||,jl^,[llj which are all consistent 
with Arratia's conjecture. Short of a rigorous result for 
the value of the Chvatal-Sankoff constant, extensive nu- 
merical studies are the only way to shed some light on the 
validity of Arratia's conjecture. The difficulty with numer- 
ical studies is that the Chvatal-Sankoff constant is defined 
as an asymptotic quantity in the limit that the sequences 
the longest common subsequence of which is computed are 
infinite. Simulations at finite sequence length always ob- 
serve finite size effects the size and even functional form 
of which is not known. In fact, two recent numerical stud- 
ies (Tl],|l^,|l3| assume different functional forms of the fi- 
nite size effect. Thus, they come to different conclusions 
about the value of the Chvatal-Sankoff constant although 
both studies agree on the fact that the Chvatal-Sankoff 
constant differs from its conjectured value Eq. (jl|). 



2 



R. Bundschuh: High precision simulations of the longest common subsequence problem 



In this communication, we will remedy this dilemma 
in two different ways. First, we will use a slightly different 
geometry in which we perform our simulations. The ad- 
vantage of this geometry is that the form of the finite size 
behavior is theoretically known from the analogy to surface 
growth |l5j and does not have to be inferred from the nu- 
merical data. We will prove this finite size behavior for the 
first-passage percolation case explicitly. While this result 
reproduces the known value of the first-passage percola- 
tion version of the Chvatal-Sankoff constant, it addition- 
ally provides the complete finite-size correction in terms of 
Legendre polynomials and an explicit formula for the pref- 
actor of the leading finite size correction term. We will use 
this to validate our numerical procedure. Additionally, we 
will reformulate the LCS problem in a way that allows for 
a multi-spin coding scheme, in which 64 lattice sites are 
updated in parallel by simple bitwise logical operations. 
This extremely efficient code allows us to average over a 
large number of systems with several million letters each. 
This yields the value of the Chvatal-Sankoff constant with 
a precision of one part in 10 6 . 



2 Longest Common Subsequence problem 

The LCS problem is defined as follows: Given two se- 
quences x = x\ . . . Xn and y = yi • • - J/jv of letters Xi and 
Dj independently chosen with equal probabilities from an 
alphabet of size c, find their longest common subsequence. 
A common subsequence of length £ is given by a set of 
(*fcijfc)'s with 1 < i\ < . . . < ii < N and 1 < j\ < 
• • • < jt < N such that x ik = yj k for all k G {1,. ..,£}, 
We denote the length of the longest common subsequence 
of two given sequences by L(N). It is a random variable 
since it depends on the specific choice of the sequences. 
Its expectation value averaged over the ensemble of all 
sequence pairs is (L(N)). Chvatal and Sankoff noted || 
that this expectation value is asymptotically proportional 
to the length of the sequences, i.e., (L(N)) « a c N. The 
prefactor a c is now called the Chvatal-Sankoff constant 
and depends only on the size c of the alphabet. To find 
high precision estimates for this value for different alpha- 
bet sizes c is the purpose of this publication. 

Numerically, the length of the longest common sub- 
sequence of two given sequences can be calculated very 
efficiently using a transfer matrix algorithm. This trans- 
fer matrix algorithm calculates the auxiliary quantity £ij 
which is defined to be the length of the longest common 
subsequence of the substrings x\ . . . Xi and yi.-.yj. The 
recursion for £ij is 

tij = max{^_ij-i + rjij^i-ij^ij-i} (2) 

where 

77- = / 1 if Xl = V: > (3) 

n - J 1 otherwise ' v ' 

This recursion is completed by the initial conditions 

tifi = 4>J = 0. (4) 




(a) (b) 



Fig. 1. Lattice representation of the longest common subse- 
quence problem, (a) shows how the two sequences ABBBA 
and AABAB are written at the edges of the lattice. The hori- 
zontal bonds of the lattice correspond to matches with a weight 
one (solid lines) or to mismatches with a weight zero (dashed 
lines.) The diagonal bonds correspond to omitted letters. Each 
common subsequence corresponds to a path from the leftmost 
to the rightmost point of the lattice. The length of the com- 
mon subsequence is the sum over all weights of the bonds which 
the path passed. The path shown as an example has a weight 
of three and corresponds to the maximal common subsequence 
ABA. The figure also shows how we use the coordinates r and t 
in order to denote lattice points and how the diamond shaped 
lattice can be split into two triangular lattices (long dashed 
line.) (b) shows the rectangular lattice which we use instead 
of the lattice of (a) in order to be able to handle the effects of 
the finite lattice size. 



The length of the longest common subsequence of the orig- 
inal strings x and y is then L(N) = £n,n- The recursion 
equation can be visualized by assigning the l^j to the 
nodes of the lattice shown in Fig. |](a) Jl^,|l9). The hori- 
zontal bonds then correspond to the Tjij and the diagonal 
bonds correspond to omitting a letter from one of the se- 
quences. The recursive algorithm calculates the path from 
the origin to the end point of this lattice with the largest 
weight as given by the sum over the bond weights through 
which the path passes. This is clearly a discrete model of a 
directed polymer (the path) in a random medium |l5[ ] (the 
random matches and mismatches f]i t j.) To make this more 
explicit, we will use "spatial" and "temporal" coordinates 
r and t instead of the indices i and j to refer to lattice 
points as shown in Fig. |l|(a). In this coordinate system the 
dynamics given by the recursive algorithm reads 

( £(r,t-l)+r]{r,t) ) 
£{r,t + 1) = max £(r - l,t) >. (5) 

(£(r + l,t) J 

The total length L(N) = £{0, 2N) of the longest com- 
mon subsequence corresponds to the total free energy of 
the directed polymer. This is clearly an extensive quan- 
tity and the proportionality constant, the free energy per 
length, is the Chvatal-Sankoff constant. It can in princi- 
ple be calculated numerically by running the algorithm (||) 
for a large number of sequences and averaging the free en- 
ergies obtained. As already mentioned the difficulty with 
this procedure is that the free energy has terms which are 
subleading in the length N of the sequences. Thus, the ra- 
tio (L(N)) /N only converges towards the Chvatal-Sankoff 
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constant a c as the sequences become very long. For high 
precision measurements, the rate of this convergence has 
to be taken into account for a proper extrapolation to- 
wards infinite sequence lengths. This is relatively hard for 
the diamond shaped lattice, since there is no theoretical 
understanding of its finite size effects. The only reliable 
result is a bound f9) (L(N))/N < a c + 0{N-^ 2 log AT) 
on the finite size dependence. Dancfk [[ll] assumed that 
this bound represents the exact functional form of the fi- 
nite size correction and used it to fit his numerical data 
generated from four pairs of sequences of length up to one 
million to arrive at a value of a c= 2 — 0.81225 ± 0.00025 
for a two letter alphabet. More recently, Boutet de Monvel 
tried to extract the finite size behavior directly from the 
numerical data without any theoretical foundation and 
concluded (L(N))/N » a c + OiN- 1 / 2 / logAQwhich lead 
him to a valuefl of a c=2 = 0.81233 ± 0.00005 ||,|l|. 



3 LCS on a rectangular lattice 

In light of the confusion on the handling of finite size ef- 
fects, we will consider a rectangular lattice of a given width 
W as the one shown in Fig. [l](b). In order to minimize the 
finite size effects, we apply periodic boundary conditions 
in the spatial (vertical) direction. For any fixed W the 
dynamics given by Eq. (||) will after some startup phase 
become stationary which allows us to measure 



a c (W) = lim -{£(W- l,2t)) 

t— >oo t 



(6) 



1 1 



w-i 



t^oo t 2W * — ' 

k=0 

with nearly arbitrary precision if we use long enough se- 
quences (note also, that in this geometry we have to per- 
form only WN instead of N 2 elementary steps as the one 
given in Eq. (0) on the diamond shaped lattice which al- 
lows for much larger sequence lengths N.) This geometry 
directly corresponds to the finite size geometry usually im- 
posed in studies of directed polymers in a random medium 
and equivalent KPZ surface growth problems pCfl . From 
these studies it is well known 21 , that 



a c (W) = a c (oo) 



IF 



higher order terms (7) 



with some constant b c . We argue in App. [A| that 



a c (oo) = lim a c (W) = a c 

W—tao 



(8) 



i.e., that calculating a c (W) for fixed W and taking the 
limit W — * oo actually yields the Chvatal-Sankoff con- 
stant a c . The advantage of this procedure over other nu- 
merical approaches is that we know the functional form 

1 Boutet de Monvel does not give an actual statistical error of 
his estimates. Instead he gives three different estimates for the 
Chvatal-Sankoff constant at each alphabet size. These three 
values stem from three different ways to treat the finite size 
effects. The statistical error cited here reflects the spread of 
these three values. 



of the finite size dependence of a c (W) in advance and can 
extract the Chvatal-Sankoff constant a c by fitting numer- 
ically estimated a c (W) to the functional form 



a c (W) 



W 



(9) 



At this point we want to remark for the reader familiar 
with the surface growth analogy of the LCS problem that 
in the surface growth language there are two finite size 
correction formulae known [ [2l| . If the surface grows for a 
very long time t but is restricted to a width W <§; t 2 / 3 
Eq. (0) applies. In the opposite regime, i.e., for W » t 2 / 3 , 
there is a different scaling behavior which now depends on 
t instead of W. It appears that the diamond shaped lattice 
of the original LCS problem is in this transient phase since 
we have t = 2W. However, the known finite size correction 
formula in this regime applies only if the surface grows 
from a flat substrate while the diamond lattice forms a 
wedged substrate leading to the more difficult finite size 
effects observed numerically by Boutet de Monvel |p^ , ^3| . 
It turns out, though, that the flat substrate can also be 
translated into the LCS language: It requires cutting the 
diamond shaped lattice of size t into two triangles as in- 
dicated by the long dashed line in Fig. |l|(a), applying the 
recursion equation (||) with the diamond boundary condi- 
tions Eq. (|4) within the left triangle, and then calculating 



a c (t) 



max 

-t+2,.. 



.t-2,t} 



£(r,t)). 



(10) 



It can be shown p9[ that lim^oo a c (t) — a c and the finite 
size correction should for this geometry be |21j| 



higher order terms. 



(11) 



This explicit finite size dependence formula could also 
be used in order to get high precision estimates of the 
Chvatal-Sankoff constant. However, as we will see below 
the rectangular geometry is advantageous in terms of the 
numerical evaluation of the recursion equation (^|). Thus, 
we will use the rectangular geometry in our current study. 

For a better theoretical understanding of the longest 
common subsequence problem on the rectangular lattice 
and in order to formulate a very efficient algorithm for its 
simulation it is helpful to concentrate on the differences 
between neighboring £(r, t). In order to preserve the sym- 
metry of the lattice shown in Fig. 0(b) we define them 



A(r,t) 



_ / £(r, t + 1) - £(r + 1, t) for r + t even 
i(r + 1, t + 1) - £(r, t) for r + t odd 



(12) 



From the recursion equation (^) it can be easily checked 
by induction that A(r, t) 6 {0, 1}. They can thus be con- 
sidered Boolean variables. In terms of these quantities, the 
recursion equation (|5|) can be rewritten as 



rj{r,t) - A{r,t - 1) 
A(r, t)=max{ A(r - 1, t - 1) - A(r, t - 1) 




(13) 
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t)(r, t) 

Z\(r-l,t)=max{ A(r,t 




A(r 1) 
l)-A(r-l,t- 



1» (14) 



for even r + 1. Note, that this formulation does not make 
any reference to the absolute lengths £(r, t) of the longest 
common subsequences. The dynamics of the LCS problem 
is thus completely captured by the dynamics of the length 
differences as given by Eqs. ([ll]) and ([l4j). Every configu- 
ration of the £{r, t) can be equally well represented by a 
configuration of the A(r,t). Reversely, a configuration of 
the A(r,t) represents a configuration of the £(r,t), if the 
consistency condition 



2W-1 



]T (-l)M(r,t) = 



(15) 



r=0 



for the validity of the periodic boundary conditions in 
£(r, t) is fulfilled. The alternating sum on the left hand 
side of Eq. ( Jl5| ) is conserved under the dynamics given 
by Eq. (|l|) and (|lj) and is thus always fulfilled, if it is 
fulfilled for the initial choice of the A(r,t = 0). 

Slaved to the dynamics of the A is the evolution of the 
average length 



£(t) 



w-i 



2W 



^2 [£(2k,t-l)+£(2k+l,t)] for t even 



k=0 

w-i 



2W 



[£(2k,t) + £{2k+l,t-l)] for t odd 



k=0 



of the longest common subsequence which changes by 

* + '>-W-^£{$ft\, t ,£!Er M 

k— 

with 

(v(r,t) ) 
S(r,t)=£(r,t+l)-£(r,t-l)=max{A(r,t-l) } (17) 

(A(r-l,t-l)j 

in every time step. Together with Eq. M) it thus yields 



a c (W) = (6(r,t)}. 



(18) 



The rewriting of a c (W) in terms of (5) opens the possi- 
bility of a very fast implementation of the longest common 
subsequence problem. We note that Eqs. (|l3|) and ( [l4|) 
transform a pair of length differences into another pair of 
length differences independent of the other pairs at the 
same time t. According to the lattice structure shown in 
Fig. |l](b) this happens for different ways of pairing at 
different times. If we interpret the zi(r, t) and n(r,t) as 
Boolean variables, we can give Eqs. (13) and dll) the form 



A(r, £)=[-■ A(r,t-1)] A [rj(r,t)V A(r-l,t-l)} (19) 
A(r-1, *)=[-■ A(r-l,t-l)] A [??(r,t)V A(r,t-1)} (20) 

Thus, we store 32 of the A(r, t) for even r and the corre- 
sponding 32 of the A(r, t) for odd r simultaneously as the 



single bits of one integer variable each. 32 of the match- 
mismatch variables r\ (r, t) can be stored in the same way. 
Then, Eqs. (19) and ( |20| ) can be performed in parallel on 
these 64 lattice sites by bitwise logical operations. After 
these operations the odd sites are shifted by one bit to the 
right so that they pair up with their new even site part- 
ners and the same procedure is repeated. After shifting 
the odd sites one bit back to the left the algorithm can 
proceed from the beginning 

In order to measure a c (W) we note that the increase 
S(r, t) in Eq. (fl6|) is given by the Boolean equation 



(21) 



5{r, t) = T)(r, t) V A(r, t- 1) V A(r- 1, t- 1) 



and can thus also be easily calculated for 32 pairs of sites 
with two bitwise or operations. A bit-count then gives the 
value of a c (W) after averaging over many time steps and 
configurations of the disorder rj(r,t). 



4 First passage percolation 

We verify the algorithm and the finite size behavior given 
by Eq. (0) on the first passage percolation version of the 
LCS problem. For this system we are not only able to 
prove explicitly the scaling form Eq. (|^) but we can calcu- 
late the full finite size correction and specifically the pref- 
actor b c of its leading term. The first passage percolation 
version of the LCS problem also obeys the dynamics given 
by Eqs. ( |l3| ) and (Q). The only difference is, that instead 
of choosing random sequences x\ . . . xn and y\ . . . tjn and 
calculating the local matches rj(r, t) from these sequences, 
we use the variables r}(r, t) which are taken to be indepen- 
dent identically distributed random variables with 



^, , _ J 1 with probab. 1/c 

mr, t) - < with probab 1 _ 



1/c 



(22) 



The analog a c of the Chvatal-Sankoff constant for this 
problem is known to be a c = 1j{yfc + 1) which had been 
conjectured by Arratia for the a c of the LCS problem. 
This result follows, e.g., as a consequence of an exact map- 
ping [ ^2| between the LCS problem and the discrete time 
asymmetric exclusion process [ p3[|24| ] , a well studied ex- 
actly solvable model of the KPZ surface growth univer- 
sality class. (The mapping basically interprets the length 
differences A(r, t) as the site occupation numbers of the 
asymmetric exclusion process.) Below, we will derive the 
result a c = 2/(y / c+l) and its finite size correction without 
explicit reference to this mapping. 

Eqs. @ and © can be interpreted in terms of a pair 
(A(r-l 7 t-l) } A(r,t-l)) E {(0, 0), (0, 1), (1, 0), (1, 1)} 
mapped onto another pair (A(r — l,t),A(r,t)) G {(0,0), 
(0,1), (1,0), (1,1)} with probabilities depending on the 
possible values of rj(r, t) . This mapping can be written 
as the matrix 

/I 



10 

V h ooo, 



i o o r 

c 1 



(23) 
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in the basis |00), 1 01) , |10), |11). A distribution of possi- 
ble states of the system can be interpreted as a normalized 
vector in the 2 2W dimensional space of all possible con- 
figurations of the 2W variables A(r, t) at a fixed time. In a 
time step with even t, each neighboring pair of length dif- 
ferences is transformed according to the matrix To. Thus, 
the whole state distribution is transformed into the 
vector T\ip) with 

w 



T 



l Tn. 



(24) 



At odd time steps the pairing of the length differences is 
different. In order to describe the transformation at odd 
time steps, we introduce the translation operator C, which 
shifts all length differences A(r, t) by one spatial position 
to the right, i.e., 



C\A 2W -i ■ ■ ■ A t Ao) = \A A 2W ~i ..-Ai) 



(25) 



The time evolution at odd times is then given by a shift 
of all length differences to the left, application of the even 
time evolution operator T and a shift of all length dif- 
ferences back to the right, i.e., by CTC^ 1 . After two 
time steps, the state vector \tp) is thus transformed into 
the state vector CTC~ 1 T\ip). The stationary state is the 
eigenvector \i[>o) of CTC~ X T with the eigenvalue one which 
is consistent with the condition (|l5|). It is easy to check, 
that this eigenvector is given by 



2W-1 

{A,-£(-i)*A=o} 

i=0 



2W-1 



with the combinatorial normalization factor 



fi=0 



c 



c- 1 



(26) 



(27) 



where J\y is the Legendre polynomial of degree W. Since 
CTC~ 1 T restricted to the subspace of state vectors con- 
sistent with condition (|l^) is irreducible and neither the 
eigenvector \ipo) nor CTC~ 1 T have any negative entries, 
it is the unique stationary state of the system according 
to the Perron Frobenius theorem pq] . 

The analog a c (W) of the Chvatal-Sankoff constant for 
this modified system can be calculated from the knowledge 
of this stationary state via Eqs. ( [l8l) and (^ll). Since rj(r, t) 
is uncorrelated from A(i — 1, t — 1) and A(r,t— 1) we get 

a c (W) = (ff(r, t) V A(r, t - 1) V A(r - 1, t - 1)> (28) 

= l-(l-l) ([n4(r ) t-l)]Ah4(r-l,t-l)]) 



= 1- I 1-- 1 Pr{A(r,t-l) = A(r-l,t-l) = 0}. 

The probability that two neighboring length differences 
vanish simultaneously can be directly calculated from the 



explicitly known stationary state Eq. (pq). Since fixing a 
pair of the length differences to zero still leaves all the 
possibilities to the W — 1 remaining pairs, it is 



Pr{ A(r, t-l) = A(r-l,t-l) = 0} = 



W-l 



(29) 



Inserting this into Eq. (28) yields the full finite size de- 
pendence of the first passage percolation version of the 
LCS problem in terms of Legendre polynomials. Expand- 
ing these Legendre polynomials for large degrees W finally 
yields the first order terms 



a c (W) 



/c-1 1 



O 



1 



(30) 



of this exact finite size correction. 

This can be directly compared to the numerical results 
obtained by the multi-spin coding algorithm described 
above. To this end we choose 1,000 random initial condi- 
tions of the A(r,t = 0) (that fulfil Eq. (|l5|)) and the same 
number of random configurations of the f?(r, t) . Then, we 
proceed as described at the end of Sec. |3|, i.e., we apply 
the recursion equations (jllj) and ( pp| ) a large number of 
times, evaluate Eq. (Ell) in every time step, and gener- 
ate estimates of a c (PFjaccording to Eq. ( |l8| ) by averaging 
over many time steps and the different configurations of 
the disorder. 

An important question is how close these estimates of 
a c (W) come to its true value. In answering this question, 
we have to consider statistical and systematic deviations. 
The statistical deviations can be measured by looking at 
the sample-to-sample fluctuations between the different 
configurations of the disorder r}(r, t) and made as small 
as we wish by increasing the number of disorder config- 
urations and the number of recursion steps we average 
over. The systematic deviations have to be studied more 
carefully. They arise because the Markov dynamics of the 
differences A(r,t) reaches its stationary state only in the 
limit of infinitely many recursion steps if we start from a 
random initial state. For a finite number t\ of recursions 
steps there is a systematic correction that decays expo- 
nentially in t\ with some decay "time" to- 

In order to verify if this exponential approach to the 
stationary state plays a role in our numerics, we iterate 
Eqs. (19) and (|2C| ) 10 7 times for each configuration of 
the disorder rj(r, t) and calculate 10 different estimates of 
a c (W) by averaging in Eq. (pf) over t e]*i,ti + 10 6 ] for 
ti £ {0, 10 6 , 2 • 10 6 , . . . , 9 • 1(F}. If there is a systematic 
deviation this should manifest itself in a systematic de- 
pendence of these estimates on the number t\ of recursion 
steps performed before the averaging starts. 

Since the approach to the stationary state has to be- 
come the slower the wider the lattice we have to study 
this effect for the largest width W= 4096 we plan on us- 
ing in our numerics. Fig. ^(a) shows the ten estimates of 
a,2(W = 4096) which are obtained as described above. 
Clearly, there is no noticeable systematic dependence on 
the number t\ of recursion steps. All fluctuations of the 
individual points are within the statistical uncertainties of 
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0.8127 



0.82840 




2e+06 4e+06 6e+06 8e+06 le+07 
(a) l i 



0.8126 



2e+06 4e+06 6e+06 8e+06 le+07 
(b) l l 



Fig. 2. Dependence of estimates of the Chvatal-Sankoff con- 
stant on the number ti of recursion steps performed before 
the averaging starts, (a) shows estimates of 02 (W) for com- 
pletely uncorrelated disorder, i.e., for the first-passage perco- 
lation problem. They do coincide within their statistical fluc- 
tuations for all ti. The dashed line shows the average over all 
points but the first, (b) shows estimates of a,2(W) for the full 
LCS problem including the disorder correlations. They have 
a larger statistical fluctuation than in the first passage perco- 
lation case and the first point deviates significantly from the 
others. Both effects stem from the much slower approach to the 
stationary state in the presence of the correlations. However, 
there is no apparent dependence on ti for ti > 10 6 in this case 
either. The dashed line shows the average over the last eight 
data points. 
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Fig. 3. Numerical estimates of the Chvatal-Sankoff constant 
for a two letter alphabet as a function of 1/W. The solid lines 
are in both cases best fits to a linear function, (a) shows values 
a^{W) for completely uncorrelated disorder, i.e., for the first- 
passage percolation problem. The dashed line indicates the ex- 
act value for infinite width, while the dotted line includes the 
predicted finite size correction. It is nearly indistinguishable 
from the best fit to a linear function (solid line), (b) shows 
the values a,2(W) for the true longest common subsequence 
problem including the disorder correlations. Here, the dashed 
and dotted lines show the upper and lower bounds for the in- 
finite system given by Danfifk jTl| and Boutet de Monvel 
|l3| respectively. 
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each other. While the statistical uncertainties of the indi- 
vidual points are already on the order of ±10~ 6 we can 
further improve them by averaging over all the points. To 
be absolutely certain that the approach to the stationary 
state does not influence our result, we take this average 
only over the last nine points, i.e., we average over all 
t G]10 6 jl0 7 ]. This average is indicated as the dashed line 
in Fig. |(a). 

After we have convinced ourselves, that we can esti- 
mate the a c (W) to within a precision of ±10 -6 for all W, 
we can study the dependence of a c (W) on the width W 
of the lattice. Fig. ||(a) shows these numerically obtained 
values ofa,2(W) for an alphabet of size c = 2 as a function 
of 1/W. The points clearly fall on a straight line over the 
whole range of lattice widths from W = 128 to W — 4096. 
A simple least square fit to a straight line yields 

a 2 (W) « (0.8284270 ± 0.0000009) - (0.0848 ± 0.0002)-^. 

(31) 

This has to be compared to the exact results 

^J— w 0.82842712 and « 0.08579. 

V2+1 2(V2 + 1) 

We conclude that the value for infinite system size is ex- 
tremely well reproduced by taking into account the 1/W 
finite size correction. Even the prefactor of the finite size 
correction can be extracted with a reasonably high preci- 
sion, although the error of the fit for the finite size coef- 
ficient as given in Eq. (|Tj) underestimates the true error. 
This is to be expected since the finite size coefficient itself 
is subject to further finite size corrections which yield an 
additional systematic error in its estimation. 



In order to get a good numerical estimate of the real 
Chvatal-Sankoff constant, we perform the same simula- 
tions for the subtly correlated disorder r](r,t) generated 
from pairs of randomly chosen sequences. In the presence 
of these disorder correlations, the dynamics of the A(r, t) 
at a given width W can still be interpreted as a Markov 
process. However, the applicable state space comprises at 
a given t the A(r, t) for r € {0, . . . , 2W — 1} and the last 
W letters chosen in each of the two sequences. It is much 
larger than in the first passage percolation version. Thus, 
although the dynamics still converges exponentially to the 
stationary state, the initial phase can be much longer than 
in the first passage percolation case and more care has to 
be taken to minimize the systematic error of numerical 
estimates introduced by this effect. 

Again, we collect estimates from 1, 000 pairs of random 
sequences and apply the recursion equations (|l9| ) and (|2C| ) 
10 times for each sequence pair. Then, we generate the 
same 10 estimates of ai (W = 4096) as in Sec. ^ for the 
largest width W = 4096 we plan to use in our simulations. 
Fig. ||(b) shows the dependence of these estimates on the 
number t\ of recursion steps performed before starting the 
averaging. We note two important differences to the first 
passage percolation case: (i) The estimate with t\ = is 
significantly different from the estimates with t\ > 10 6 and 
(ii) the individual statistical fluctuations are larger than 
in the first passage percolation case (note the difference 
in scale between Figs. ||(a) and (b).) Both findings are to 
be expected due to the slower approach to the stationary 
state. Since the first estimate is different from the other 
nine we conclude that the characteristic number of steps to 
necessary to approach the stationary state is smaller than 
but comparable to 10 6 . This slow decorrelation property 
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also renders the average over many recursion steps less ef- 
fective yielding the larger statistical fluctuations which we 
observe. These statistical fluctuations are of the order of 
±3.5 • 10~ 6 . Since the systematic deviation of the estimate 
for t\ = 10 6 is already smaller than that and since the 
characteristics number of steps to is smaller than 10 6 , we 
can be sure that the systematic deviations of all points 
with fi > 2 ■ 10 6 are less than 10 -6 due to the exponen- 
tial decay of these deviations. Then, averaging in Eq. ( |l8|) 
over all t s]2 • 10 6 , 10 7 ] yields estimates of a c (W) with a 
statistical and systematic error of ±10~ 6 . 

The results of this procedure for all finite system sizes 
W are shown in Fig. ||(b). In addition to the larger statisti- 
cal error bars of the individual points slight corrections to 
the 1/W behavior are visible for W < 160. Nevertheless, 
we get a good fit to a straight line in the regime W > 160 
with the result 



a 2 (W) « (0.812653 ± 0.000004) - (0.0520 ± 0.0012) 



1 

w 

(32) 

Varying the fitting range towards larger W changes the 
infinite W value of the measured Chvatal-Sankoff con- 
stant only within the quoted error bars. This result is 
significantly larger than the results of Dancfk jll] and of 
Boutet de Monvel (lj,[n| obtained for 4 lattices of size 
up to 10 6 x 10 6 and 10, 000 lattices of size 10 4 x 10 4 re- 
spectively. As shown in Fig. ||(b) the value of the finite 
width Chvatal-Sankoff constant already significantly ex- 
ceeds their estimates of the infinite system size value at 
moderate W ss 256. 

In the same way we get high precision values for the 
Chvatal-Sankoff constant for larger alphabet sizes c. How- 
ever, the finite size corrections for larger alphabet size are 
larger and fits to a linear law only settle to within the 
statistical error estimate for W > 352. The fitted values 
are summarized in Tab. |l|. While our method produces 
estimates with a much higher precision than the tradi- 
tional method due to the fact that we know the form of 
the finite size dependence, all our estimates for c > 4 are 
consistent with Boutet de Monvel's within his variations 
between different assumed finite size corrections [[l2|. We 
observe that the differences between the Chvatal-Sankoff 
constant a c and its counterpart a c for the uncorrelated 
disorder are significant but become smaller with increas- 
ing alphabet size. This is to be expected intuitively since 
the correlations between the rj{r, t) become weaker with 
increasing alphabet size (if we know the letter in one se- 
quence and the information that we have a mismatch this 
fixes the letter in the other sequence only up to c — 1 
possible choices.) This dependence on the alphabet size is 
also consistent with the fact that it apparently becomes 
more difficult to take the correlations into account numer- 
ically and to get good numerical estimates of the a c as the 
alphabet size decreases. This materializes in the discrep- 
ancies between our high precision results for a 2 and earlier 
estimates of the Chvatal-Sankoff constant which are not 
present for the larger alphabet sizes. 



c 


a c 


b c 


a c 


2 


0.812653(4) 


0.052(1) 


0.828427 


4 


0.654361(2) 


0.122(1) 


0.666667 


8 


0.515143(4) 


0.197(2) 


0.522408 


16 


0.396316(2) 


0.268(1) 


0.4 



Table 1. Numerical values for the Chvatal-Sankoff constant 
o c and its finite size correction b c for different alphabet sizes 
c. The errors in parentheses are statistical errors on the last 
digit. For comparison the counterpart a c = 2/(^/c+ 1) of the 
Chvatal-Sankoff constant for first-passage percolation is also 
given. 



6 Conclusions 

We conclude that in order to get numerical values of the 
Chvatal-Sankoff constant with a high precision a proper 
handling of the finite size effects is absolutely necessary. 
We showed how concepts from statistical physics can help 
to tackle this problem. We moreover presented a highly 
effective multi-spin coding scheme which further increases 
the precision of the simulations. The numerical results 
given will be valuable in guiding and evaluating theoretical 
approaches to understand the small but noticeable effect 
of the subtle disorder correlations which are responsible 
for the difference between the Chvatal-Sankoff constant 
and its counterpart in the first-passage percolation prob- 
lem. 
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A Change of lattice 

In this appendix we will argue that the limiting quantity 
a c (oo) = lim a c (W) = lim lim -(£(W - 1, 2*)) (33) 

W^oc W^oo t— too t 



defined on the rectangular lattice shown in Fig. |T 
the Chvatal-Sankoff constant 



lim 



(L(N)) 
N 



equals 



(34) 



defined on the diamond shaped lattice shown in Fig. 0(a). 
To this end, we first consider a path from the left edge of 
the rectangular lattice to the point (W— 1, 2t) as indicated 
in Fig. |](a) which is the optimal path in each of the dia- 
monds. Clearly, i(W — 1, 2t) can only be larger than the 
number of matches along this specific path which implies 



(£(W-l,2t))> w (L(W)) 



(35) 
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(a) 
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Fig. 4. Relation between optimal paths on the rectangular and 
the diamond shaped lattice, (a) shows how a good path for the 
rectangular lattice can be constructed by stringing together 
optimal paths on diamond shaped lattices of size W x W . In (b) 
a good path for a diamond shaped lattice of size (t + W) X (t + 
W) is constructed from the optimal path on the rectangular 
lattice. The left end of the path is connected to the tip of 
the diamond which can only increase the number of matches. 
For each time the optimal path wraps around the periodic 
boundary conditions the path has to be modified as shown 
by the diagonal dashed line. For each such event, at most W 
matches along the solid path can be missed. 



or after dividing both sides by t and taking the limit of 
t — > oo 

a c {W) > — yy—. (36) 
Taking the limit W — * oo of this inequality yields 

a c (oo) > a c . (37) 



On the other hand, let us consider the best path from 
the left edge of the rectangular lattice to the point (W — 
1, 2t\ i.e., the path with £(W—1, 2t) matches. As shown in 
Fig. 0(b), we can construct a path which connects the two 
end points in a diamond shaped lattice of size (t + W) x 
it + W) from this path. Since the length of the longest 
common subsequence of the two associated sequences of 
length t + W can only be longer than the common subse- 
quence corresponding to this path we get 

(L(t + W)) > (£(W - 1, 2t)) - W{n m&p {t)), (38) 

where n mSjV {t) is the number of times that the best path 
from the left edge to iW — 1, 2t) "wraps around" the pe- 
riodic boundary conditions. Dividing by t yields 



t + W (L(t + W)) {£(W 



l,2t)> 



t t + W 
which in the limit t 



oo leads to 



-W- 



^wrap(^)) 



a c > a c (W) 



W Km 

t— >oo 



{^wrap (^)) 



(39) 



(40) 



From the analogy with the directed polymer in a random 
medium it is well known that the displacement of a typi- 
cal optimal path of length t scales like t 2 / 3 pRBOlgfl,Bg]. 



Thus, we expect one wrapping around every W 3 ! 2 steps 
along the lattice, i.e., 

("wrapt*)) ~ W- 3 ' 2 t. (41) 

Therefore, if we take the limit W — > oo in Eq. (^) the 
second term on the right hand side vanishes and we are 
left with 

a c > a c (oo). (42) 

Together with Eq. (|37|) this establishes the equality be- 
tween the infinite width limit of the a c (W) and the Chva- 
tal-Sankoff constant a r . 
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